Hello everyone and welcome again.
As part of a personal process, I decided to analyze and decipher several databases on subjects that I'm interrested in, and to share with you the results I obtained.
Today we're gonna analyse a market that I've been personnally investing in : the Sneakers market. This market has grown so much over the past decade. Statista's recently published a report on the Sneakers market indicating that the "global Sneakers" reached $127.3 billion of market capitalisation by 2021.
The current U.S. sneaker resale market is estimated at 2 billion dollars, but is expected to explode 15 times to 30 billion dollars by 2030. One of the main reasons: value appreciation. Sneakers are increasingly seen as an alternative asset.
The expansion of this market has witnessed the groth of new actors such as GOAT or StockX.
Earlier in 2021, resale marketplace app GOAT raised 195 million dollars in venture capital funding for a 3.7 billion dollars valuation. The company claimed 2 billion dollars in sales on the platform from mid-2020 to July 2021, with sneaker sales doubling.
Like GOAT, resale platform StockX was able to raise capital at a $3.8 billion valuation with IPO prospects later this year.
Today, we're going to anlyse a StockX Dataset and try to make predictions on this dataset. Here are the questions we're gonna try to answer :
First, let's try import the packages we're gonna need :
!pip install --upgrade matplotlib
!pip install colour
!pip install dython
import warnings
warnings.filterwarnings('ignore')
from colour import Color
from collections import OrderedDict
import pandas as pd
import numpy as np # linear algebra
import itertools # iteration utils
from scipy.interpolate import griddata # for 3d surface plot
import pickle
import plotly.express as px
import plotly.graph_objects as go
from plotly import subplots
import plotly.io as pio
df = pd.read_csv("StockX-Data-Contest-2019-3.csv")
df.head()
Currently the dataset consists of the single file of sales provided by StockX. ~10000 shoe sales from 50 different models, yeezys and Nike X Off_white.
The name of the features is pretty obvious and doesn't need explaination. We could be The first thing we are going to do is to add new features that we're gonna use later in our study:
The first feature that can be interresting to have is the color of the shoe. The color of a product is indeed often a criteria for customer, accordingly to their preferences. The color is indicated in the "Sneaker Name" feature, but not clearly identified as an independant feature.
def check_color(color):
try:
Color(color)
return True
except ValueError:
return False
colors = []
for i in df["Sneaker Name"]:
color = ""
words = i.split("-")
for word in words:
if check_color(word):
color+=(" "+word)
colors.append(color)
df["Colors"] = colors
df.head()
This obviously works when the color is in the sneakers name, but not for all of them, let's check how many blanks we have :
len(df.loc[df["Colors"]==""])
Wih this method, we have 46314 blanks. That's half of our dataset. To resolve this problem, we're gonna scrap the colors directly from the StockX website. For this, we need a list of all different sneakers in our dataset. Here is our list :
df.pop('Colors')
df["Sneaker Name"] = df["Sneaker Name"].str.replace("-"," ")
allnames = df["Sneaker Name"].unique()
I won't detail here how I did to scrap the data on StockX, but I published a Medium article that you can access here. You can also get the whole code on my github by clicking this. What I'll do right now will be to load the dictionnary that contains all the sneakers colors and get the main colors out of it.
with open('saved_dictionary.pkl', 'rb') as f:
final_dict = pickle.load(f)
for key in final_dict.keys():
final_dict[key] = final_dict[key].replace("/"," ")
final_dict[key] = final_dict[key].replace("-"," ")
print(final_dict)
for i in final_dict.keys():
color = ""
words = final_dict[i].split(" ")
for word in words:
if check_color(word):
color+=(" "+word)
color = ' '.join(OrderedDict.fromkeys(color.split())).split(" ")[0]
final_dict[i] = color
print(final_dict)
colors_list = []
for i in df["Sneaker Name"]:
colors_list.append(final_dict[i])
df["Colors"] = colors_list
We still have a few sneakers that dont have a corresponding color:
df["Sneaker Name"].loc[df["Colors"]==""].unique()
For these 4 models, we will just fill by hand the corresponding color.
final_dict['Nike Air Max 90 Off White'] = 'WHITE'
final_dict['adidas Yeezy Boost 350 V2 Butter'] = 'BEIGE'
final_dict['Adidas Yeezy Boost 350 V2 Sesame'] = 'GREY'
final_dict['Nike Air Max 90 Off White Desert Ore'] = 'TAN'
colors_list = []
for i in df["Sneaker Name"]:
colors_list.append(final_dict[i])
df["Colors"] = colors_list
Now that we have our colors, let's create a new column that will represent the capital gain of our sneakers. This new feature will help us quantify how much value a certain sneaker gained through time.
df['Sale Price'] = df['Sale Price'].str.replace("$","")
df['Sale Price'] = df['Sale Price'].str.replace(",","")
df['Retail Price'] = df['Retail Price'].str.replace("$","")
df['Retail Price'] = df['Retail Price'].str.replace(",","")
df['Retail Price'] = df['Retail Price'].astype(float)
df['Sale Price'] = df['Sale Price'].astype(float)
df["Margin"] = df['Sale Price']-df["Retail Price"]
df["Value_increase(%)"] = (df['Sale Price']*100) /df["Retail Price"]
Another feature that could be interesting to add would be the nuber of days that passed between the order date and the realease date. This will indicate if it's better for a customer to buy a sneaker when it drops or later, and how the value fluctuate through time.
df['Order Date']= pd.to_datetime(df['Order Date'])
df['Release Date']= pd.to_datetime(df['Release Date'])
df["Days_between_OR"] = df['Order Date']-df['Release Date']
df
Now let's have a look at our features, their repartition and how they're related to each other.
px.bar(df, x='Order Date', y='Sale Price', color='Brand', color_discrete_map={'Yeezy':'black','Off-White': 'green'}, title='Volume of sales by sneaker brand')
This graph shows us the volume of sales by brand. It can also be interpreted as an indicator of performance of each brand for the resale ability, or liquidity. The bigger the value, the better a pair of sneakers is sold. We can see that globally, the volume of Off White sneakers sold is way bigger than the volume of yeezy sneakers sold. We can also see 2 significant peaks on these graphs. They both happened between November and january. This is not surprising since it's corresponding to the christmas and new year period. Be careful tho, this is not a profitability indicator. Indeed, it's not because a pair of shoes was sold more that the margin is bigger. Let's evaluate how the profitability evolves through time:
px.bar(df, x='Order Date', y='Margin', color='Brand', color_discrete_map={'Yeezy':'black','Off-White': 'green'},title='Sum of profitability by sneaker brand')
Again, we can see that the margins are bigger in the end of the year. We can also see that the margin made on OFF WHITE shoes is bigger than the margin made on YEEZYs. Selling OFF WHITE in the end of the year seems to be an interresting investment choice.
table = df.groupby(['Buyer Region'])['Sneaker Name'].count().reset_index()
# importing geopy library
from geopy.geocoders import Nominatim
# calling the Nominatim tool
loc = Nominatim(user_agent="GetLoc")
# entering the location name
getLat = [loc.geocode(region).latitude for region in table["Buyer Region"]]
getLong = [loc.geocode(region).longitude for region in table["Buyer Region"]]
table["Lat"] = getLat
table["Long"] = getLong
fig = px.scatter_geo(
table, lon = table['Long'], lat = table['Lat'],color="Buyer Region",
size="Sneaker Name", hover_name="Buyer Region",scope = 'usa', color_continuous_scale=[(0, "white"), (1, "green")]
)
fig.show()
Let's now have a look on the color variable we created. To what extenr do we have different margins depending on the color of the sneaker?
margin_per_col = df.groupby("Colors").sum().reset_index()
fig = px.pie(
margin_per_col, values='Margin', names='Colors',color="Colors",
hole=0.5, color_discrete_map=dict(zip(df.Colors, df.Colors)))
fig.update_layout(height=500, title='Color of the sneakers sold')
fig.show()
It seems like the biggest margins are made with neutral tones such as Black, White or grey colors. We can also see that unusual colors such as green or pink seem to be less profitable.
As a reminder, the 2 questions we are trying to answer are the following :
Now that we have an idea on how our features behave. We're going to modelize the margin to see what features have an impact, and we're gonna make predictions on the sale price accordingly to the time. Let's see which features are to be taken in our models.
Now that we have all our features, let's try to determine which features are the most significant for a good profit margin. Let's start by computing the correlation between our features, which is a measure of association between two features, wheter they're categorical or numerical. The features that have a too high or low correlation will be removed.
As you can see, we have categorical and numerical features in our dataset. This means we have to use specific methods in order to calculate the correlation between 2 features. The first method we could use is the cramer's V, which is based on a nominal variation of Pearson’s Chi-Square Test. Unfortunately, this method implies to loose part of the information since it's a symetrical method. This means that, given 2 variables x and y, cramers_v(x,y) = cramers_v(y,x). But it is not always true.
We need an asymmetric measure of association between categorical/Numerical features. And this is exactly what Theil’s U is. Theil’s U, also referred to as the Uncertainty Coefficient, is based on the conditional entropy between x and y — or in human language, given the value of x, how many possible states does y have, and how often do they occur.
import numpy as np
import scipy.stats as ss
import seaborn as sns
from dython.nominal import conditional_entropy
from collections import Counter
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
def theils_u(x, y):
s_xy = conditional_entropy(x,y)
x_counter = Counter(x)
total_occurrences = sum(x_counter.values())
p_x = list(map(lambda n: n/total_occurrences, x_counter.values()))
s_x = ss.entropy(p_x)
if s_x == 0:
return 1
else:
return (s_x - s_xy) / s_x
corr = pd.DataFrame()
corr["Features"] = df.columns
for col in df.columns:
thelis_u_list = []
for ft in corr["Features"]:
thelis_u_list.append(theils_u(df[col], df[ft]))
corr[col] = thelis_u_list
corr = corr.set_index("Features")
Here's a heatmap that will help us visualize the correlations :
fig = px.imshow(corr, color_continuous_scale=[(0, "white"), (1, "green")])
fig.show()
We can see here that the features that seem to have the biggest impact on the profitability seem to be the sneaker name, the brand, the retail price, the release date, and the colors. On the opposite, the shoe size, the buyer Region and the order date don't have a big influence on the profitability.
Once we have trained a model it is possible to apply further statistical analysis to understand the effects features have on the output of the model and determine from this which features are most useful.
To illustrate feature importances I will implement a random forest classifier and use the built-in feature importances method in scikit-learn.
import datetime
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
def normalize_df(df):
scaler = MinMaxScaler()
scaler.fit(df)
scaled = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
def dummy_encoding(df):
for col in df.columns:
if df[col].dtype==object:
df = pd.get_dummies(df, columns=[col], prefix = col, prefix_sep = "_")
return df
def factorize(df):
for col in df.columns:
if df[col].dtype==object:
df[col] = pd.factorize(df[col])[0]
return df
target = "Margin"
rfc_df = df.copy()
#We want to consider the shoe size as an object
rfc_df['Shoe Size'] = rfc_df['Shoe Size'].astype(str)
#rfc_df['Retail Price'] = rfc_df['Retail Price'].astype(str)
rfc_df.pop('Order Date')
rfc_df.pop('Release Date')
rfc_df.pop('Days_between_OR')
rfc_df = dummy_encoding(rfc_df)
rfc_df = normalize_df(rfc_df)
from sklearn import preprocessing
from sklearn import utils
#convert y values to categorical values
lab = preprocessing.LabelEncoder()
# Spliiting data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(rfc_df.drop(target, axis=1), lab.fit_transform(rfc_df[target]), test_size=0.20, random_state=0)
# fitting the model
model = RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42, max_depth = 15)
model.fit(X_train, y_train)
# plotting feature importances
features = rfc_df.drop(target, axis=1).columns
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,15))
plt.title('Feature Importances')
plt.barh(range(len(indices[-22:-2])), importances[indices[-22:-2]], color='g', align='center')
plt.yticks(range(len(indices[-22:-2])), [features[i] for i in indices[-22:-2]])
plt.xlabel('Relative Importance')
plt.show()
Without considering the numerical variables such as resell price or the value increase, this graph shows us what modalities seem to be the most profitable. Here I took the 20 first modalities that have the biggest feature importance. We can see here that the Yeezy Boost 350 V2 model seem to be very profitable for StockX. Again let me explain the margin variable. It represents the sum of all margins made for all transaction. So if the margin is small but a lot of shoes are sold, the profitability will be big for StockX. We can also see that shoe size between 9 and 12 seem to make the biggest profit. Same for basic colors such as black or white, and states such as California or New York.
In order to answer that question, we will group the sales by order date. We will use that new DataFrame to compute a time series forecasting.
df_ts = df.groupby("Order Date").sum()["Sale Price"].reset_index()
df_ts.head(3)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_ts["Order Date"], y=df_ts["Sale Price"], mode='lines', name='Sale Price',))
fig.data[0].line.color = "green"
fig.update_layout(
title='Sale Price Performance',
xaxis_title='Date', yaxis_title='Price'
)